Covid 19 US cases analysis

Authors: Dang Le and Vasu Dadhania
SOEN 6111 - Winter 20201
Present to Dr. Tristian Glatard

In this notebook, we will present the steps and results that are used to analyze COVIS 19 cases data set from New York Times. This dataset is updated daily on github and therefore represent an accurate number of confirmed cases in US counties over time.

Data preliminary analysis

First we would like to introduce the dataset, which can be obtained from https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv.

As shown above, the data we are using have 6 columns:

There are over 1 millions rows of data given, with 1,930 counties reporting covid cases over the time period of 441 days. Furthermore, we can see that there are over three thousands fips numbers, which is almost twice as many counties, meaning that each county can be broken further down into smaller regions represent by fips.

Because the given data is already sorted by date, we can see that Washing was the first state reported confirmed case, which is quite interesting. Furthermore, the data is very sparse in the early stage of the time interval, thus making it quite difficult for clustering because there are not enough information. Thus, we decided to transform the data into matrix form, with the rows represent each fips, the column represent the date and the value is number of confirmed cases for each county at a given day.

To start this transformation, we first observe that we only need 3 columns, date, fips and cases, because we can deduce the state and county from any given fips. Also the deaths confirmed number contains many zero, so we decided to drop that column.

After grouping the data by fips number, we calculate the means value of cases reported and call the describe method to find some statistics. Looking the count of 3218 which equals to the number of fips we find above and confirm that the group by is correct.

Looking at the mean and standard deviation value, we can see that the data is more spread out, which indicate that clustering technique would be useful in this application to identify and group similar data points. Furthermore, from the distribution information, we can see that there are abnormally, where some group have mean value of only 1 cases, and some having over 300,000 cases.

Using the histogram graph, we can see that the majority of counties have mean COVID-19 cases of less than 75,000. However, there are regions with significantly more daily cases, and we suspect these regions to be highly populated, however we will only be confirmed this hypothesis when doing the clustering. Also note that the graph is converted to log scale for cleaner representation.

Clustering

In the section, we will discuss our implementation of K-Means and Hierarchical Clustering to the presented dataset above.
However, before we proceed with the clustering techniques, we need to identify how many clusters is suitable for our dataset. In order to do so, we pick out a sample data set to run k means clustering on, with k value set from 1 to 30 to find the best value.

Using the elbow graph presented above, we can see that the best number of cluster should be between 6 and 8, therefore we decide to pick 7 as our number of clusters in both experiments.

K Means Clustering

We first start our experiment with K means clustering, because it is seems as a faster method of the two. Because there are some data missing during the first couple months of the pandemic, we have to limit our experiment to use the last 100 days of the data only.

With the K means clustering algorithm, we are able to calculate the Silihouette and Davies Bouldin score, which confirm that the algorithm is working right. However, it would also be useful if we can see the result plot onto an US map.

Hierarchical Clustering

Similar to the k means clustering above, we then repeat the same experiment using Hierarchical clustering technique.

We then plot the result onto a US map.

Analysis

First we would like to compare the effectiveness of the two clustering techniques, using the Silihouette and Davies Bouldin score.

Silihouette Score

According to scikit-learn website, Silihouette score is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The best value is 1 and worst value is -1, with value near 0 indicates overlapping clusters.
Looking at our result from above, K-means seems to be a better algorithm, with Silihouette score of 0.7651, while the Hierarchical have the score of 0.7468.

Davies Bouldin Score

Next we will analyze the two result using Davies Bouldin, which measure the by the average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances.
The minimum score is 0, with lower value indicate better clustering. The score we obtain for K mean is 0.4032, while the score for Hierarchical is 0.4329. This result again confirms that K Means is a better algorthm between the two.

Because the given data is not extremely large, comparing the run time of the two algorithms is not feasible, thus we would like to keep this for future work, where dataset size is more suited for performance conparision.
Next, we would like to point out some interesting facts from our result above.

Clustering Result

First of all, the results seems to be very similar, which confirm the facts even though K means is a more effective algorithm, Hierarchical is still correct and product reliable result.

Next, we can see that highly populated areas such as Los Angeles, New York and Miami are grouped into similar cluster, which confirm our hypothesis that cities with high population are more likely to have similar COVID 19 progression. However, to further valid our hypothesis and see which cluster can be classified as high risk versus low risk, we would take randomly one sample from each cluster and graph it to see the progression over time.

After plotting out the cases progression over time, we can see that cluster 1 has the highest number of daily covid cases, follow by cluster 4 and 2. Cluster 3 can be considered as medium risk, where as cluster 0, 5 and 5 is relatively low risk.

Conclusion

Overall, this project have successfully demonstrated the ability and versatile of clustering technique in analyzing the COVID-19 dataset provided from New York Times. We are able to identify regions with high, medium and low risks and therefore would make a good vaccination plan, where high risks areas are prioritized, then medium and low risks regions would follow.

Furthermore, as a future work, it would be interesting to repeat the same experiment with different time interval to identify regions that move from one cluster to another. This would identify states/counties with good preventive measurement in place and therefore allows stakeholders to make an effective preventative plan.
Also, we can cluster regions based on number of confirmed death and see if there are regions that belong to high risk cluster in terms of daily cases but lower risk in terms of death. This would mean that they have an effective treatment plan for infected patients.